2022-05-07

Introduction

  • Diabetes is a prevalent disease worldwide
  • Huge amount of data available regarding diabetes
  • Motivation: Use data to help understand which factors contribute to developing the disease

Materials

NHANES glycohemoglobin data

National Health and Nutrition Examination Survey data concerning glycohemoglobin levels and diabetes mellitus (DM) from https://hbiostat.org/data/.

Why this dataset?

  • Managable size: 20 variables, 6795 observations
  • Wide spectrum of variables
  • Contains missing values to handle
  • Explore correlations between diagnosis with DM and the other variables

The data

Variable Description Units Levels
seqn Unique patient ID
sex Gender 0, 1
age Age Years 12 - 80
re Race/ethnicity 5 levels: White, Black, Mexican, Other Hispanic, Other
income Family income level $ 14 levels from 0 - 100000
tx On Insulin or Diabetes meds 0, 1
dx Diagnosed with DM or pre-DM 0, 1
wt Weight kg 28 - 239.4
ht Height cm 123.3 - 202.7
bmi Body-mass index kg/m^2 13.18 - 84.87
leg Upper leg length cm 20.4 - 50.6
arml Upper arm length cm 24.8 - 47
armc Arm circumference cm 16.8 - 61
waist Waist circumference cm 52 - 179
tri Triceps skinfold thickness mm 2.6 - 41.1
sub Subscapular skinfold thickness mm 3.8 - 40.4
gh Glycohemoglobin % 4 - 16.4
albumin Albumin g/dL 2.5 - 5.3
bun Blood urea nitrogen mg/dL 1 - 90
SCr Serum Creatinine mg/dL 0.14 - 15.66

Variable types

Variable Description Units Levels
seqn Unique patient ID
sex Gender 0, 1
age Age Years 12 - 80
re Race/ethnicity 5 levels: White, Black, Mexican, Other Hispanic, Other
income Family income level $ 14 levels from 0 - 100000
tx On Insulin or Diabetes meds 0, 1
dx Diagnosed with DM or pre-DM 0, 1
wt Weight kg 28 - 239.4
ht Height cm 123.3 - 202.7
bmi Body-mass index kg/m^2 13.18 - 84.87
leg Upper leg length cm 20.4 - 50.6
arml Upper arm length cm 24.8 - 47
armc Arm circumference cm 16.8 - 61
waist Waist circumference cm 52 - 179
tri Triceps skinfold thickness mm 2.6 - 41.1
sub Subscapular skinfold thickness mm 3.8 - 40.4
gh Glycohemoglobin % 4 - 16.4
albumin Albumin g/dL 2.5 - 5.3
bun Blood urea nitrogen mg/dL 1 - 90
SCr Serum Creatinine mg/dL 0.14 - 15.66

DX does not differentiate between type I and type II

Methods

Data journey

Data cleaning - Imputation of NAs

Variable Description Units Levels
income Family income level $ 14 levels from 0 - 100000

Here we assigned the mean from all non-NA values of income.

Variable Description Units Levels
leg Upper leg length cm 20.4 - 50.6
arml Upper arm length cm 24.8 - 47
armc Arm circumference cm 16.8 - 61
waist Waist circumference cm 52 - 179
tri Triceps skinfold thickness mm 2.6 - 41.1
sub Subscapular skinfold thickness mm 3.8 - 40.4

Here we implemented KNN (K=5) in tidyverse. We did not optimize K.

Data cleaning - Removal of outliers

Biochemical variables have more outliers

Variable Description Units Levels
SCr Serum Creatinine mg/dL 0.14 - 15.66

Normal range is 0.6 - 1.2 mg/dL, 5+ indicates severe kidney impairment. We removed all values above 5 (17 total values). Source: https://www.medicinenet.com/creatinine_blood_test/article.htm

Results & Discussion

Explorative data analysis

Linear correlation between numeric variables



Positive correlations primarily betweeen body-size related variables.

Explorative data analysis

Diagnosis status across BMI class

  • Excess weight and increasing obesity levels seem to be a contributing factor to the development of diabetes.

Explorative data analysis

Age as a contributing factor to diagnosis across BMI class

Explorative data analysis

Treatment status of different ethnicity and age

  • Older individuals tend to receive treatment to a larger extent compared to younger individuals.
  • No correlation between treatment status and ethnicity.

Explorative data analysis

Influence of income and ethnicity on treatment status

Annual income levels and ethinicity do not seem to influence treatment status.

Explorative data analysis

Serum albumin levels in relation to diagnosis

Serum albumin is lower in diagnosed compared to non-diagnosed individuals.

Principal Component Analysis

Investigation of patterns concerning diagnosis of diabetes mellitus

Variables dx, tx, leg, arml, wt and ht were excluded

Principal Component Analysis

Investigation of patterns in relation to BMI

Variables bmi, wt and ht were excluded

K-means clustering

Identify relevant number of clusters

K-means clustering

Clusters between age and all other variables

Single Parameter Logistic Regression

Performing single parameter evaluation to have a baseline

The precision of the BMI & GH is 29% and 68% respectively

Parameter Impact Estimation

Parameters vs P-value

Multi-Factorial Logistic Regression

The precision of this model is at 80%

Conclusion

  • Diagnosis of DM correlates with age, blood glucose, bmi, albumin and to some degree BUN
  • Income, ethnicity and gender does not appear to predict DM diagnosis or treatment status
  • Blood glucose overrules other variables in predicting DM diagnosis

In summary: We cannot cluster patients based on these variables alone

Idea for further research: Appears that older people who have diabetes tend to be treated more often than younger people with diabetes